首页> 外文OA文献 >The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems
【2h】

The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems

机译:现代高性能计算系统中批处理BLAS的设计和性能

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。
获取外文期刊封面目录资料

摘要

A current trend in high-performance computing is to decompose a large linear algebra prob-lem into batches containing thousands of smaller problems, that can be solved independently,before collating the results. To standardize the interface to these routines, the community isdeveloping an extension to the BLAS standard (the batched BLAS), enabling users to performthousands of small BLAS operations in parallel whilst making efficient use of their hardware.We discuss the benefits and drawbacks of the current batched BLAS proposals and performa number of experiments, focusing on GEMM, to explore their affect on the performance. Inparticular we analyze the effect of novel data layouts which, for example, interleave the ma-trices in memory to aid vectorization and prefetching of data. Utilizing these modificationsour code outperforms both MKL and CuBLAS by up to 6 times on the self-hosted Intel KNL(codenamed Knights Landing) and Kepler GPU architectures, for large numbers of DGEMMoperations using matrices of size 2 � 2 to 20 � 20.
机译:高性能计算的当前趋势是将大的线性代数问题分解为包含成千上万个较小问题的批处理,这些问题可以在整理结果之前独立解决。为了使这些例程的接口标准化,社区正在开发对BLAS标准(批处理的BLAS)的扩展,使用户能够并行执行数千个小的BLAS操作,同时有效利用其硬件。我们讨论了当前的优缺点。批处理了BLAS提案并进行了许多针对GEMM的实验,以探讨它们对性能的影响。特别是,我们分析了新颖的数据布局的效果,例如,对存储器中的矩阵进行交织以辅助数据的矢量化和预取。利用这些修改,我们的代码在自托管的英特尔KNL(代号为Knights Landing)和开普勒GPU架构上的性能要比MKL和CuBLAS高出6倍,对于使用大小为2×2到20×20的矩阵的大量DGEMM操作而言。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号